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ABSTRACT 

A detailed investigation of the statistical procedure 
of W. Stout (the computer program DIMTEST) for testing the hypothesis 
that an essentially unidimensional latent trait model fits observed 
binary response data from a psychological test is presented. One 
finding is that DIMTEST may fail to perform as desired in the 
presence of guessing when coupled with many high-discriminating test 
items. A revision of DIMTEST is proposed to overcome this limitation. 
Also, an automatic approach is devised to determine the size of the 
assessment subtests. Further, an adjustment is made on the estimated 
standard error of the statistic on which DIMTEST depends. These three 
refinements result in an improved procedure that is shown in 
simulation studies to adhere closely to the nominal level of 
significance while achieving considerably greater power. Finally, 
DIMTEST is validated on real data sets from the Armed Services 
Vocational Aptitude Battery (1,984 and 1,961 examinees) and the 
American College Testing program (2,491 and 2,494 examinees). Seven 
tables present analys is resul ts , and 46 references are included. 
(Author/SLD) 
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Refinements of Stout's Procedure for Assessing 
Latent Trait Unidimensionaiity 



Abstract 

This paper provides a detailed investigation of Stout's statistical procedure (the 
computer program DIMTEST) for testing the hypothesis that an essentially 
unidimensional latent trait model fits observed binary item response data from a 
psychological test. One finding was that DIMTEST may fail to perform as desired in the 
presence of guessing when coupled with many high--discriminating items. A revision of 
DIMTEST is proposed to overcome this limitation. Also, an automatic approach is devised 
to determine t^e size of the assessment subtests. Further, an adjustment is made on the 
estimated standard error of the statistic on which DIMTEST depends. These three 
refinements have led to an improved procedure that is shown in simulation studies to 
adhere closely to the nominal level of significance while achieving considerably greater 
power. Finally, DIMTEST is validated on a selection of real data sets. 

Subject terms: Unidimensionaiity, essential independence, essential unidimensionaiity, 
DIMTEST, item response theory. 
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Refinements of Stout's Procedure iox Assessing 
Latent Trait Unidimensionality 



Item response theory (IRT) is presently one of the most widely used techniques in 
psychometrics and is likely to ::emain so in the future. Some applications of IRT include 
ability estimation, item/test bias, equating, and adaptive testing. The three assumptions 
underlying many commonly used IRT models are monotonicity, unidimensionality ((i=l), 
and local independence (LI). Monotonicity assumes that the probability of correctly 
answering an item increases as ability increases. Unidimensionality^ assumes that the items 
of a test measure a single ability. Local independence assumes that given any particular 
level of ability, responses to different items are independent. This paper is concerned with 
the statistical assessment of the assumption of unidimensionality. Most IRT models 
specifically require this assumption; moreover, classical test theory models implicitly 
assume that items measiue the same dominant dimension. In spite of the importance of 
this assumption, it is also well known that actual data are rarely strictly unidimensional. It 
has long been argued that items are multiply determined and that, in addition to 
measuring the intended attribute, other attributes unique to individual items or common to 
relatively few items are unavoidable (Humphreys, 1981, 1985, 1986; Hambleton & 
Swaminathan, 1985; Reckase, 1979, 1985; Stout, 1987; Traub, 1983; Yen, 1985). In addition 
to the multiple item attributes that influence dimensionality, examinee characteristics such 
as differential teaching methods, the point of time during the instructional unit that the 
test is given, and so forth, also can influence the dimensionality of a set of items 
(Birenbaum & Tatsuoka, 1982; Bejar, 1983; Traub, 1983). Dimensionality is therefore a 
property of both the test and the examinee population taking the test (Reckase, 1990). 

Linear factor analysis (subjectively interpreted in the absence of a statistical 
distribution theory) has been the traditional approach for assessing the dimensionality of a 
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set of items. If the results of a linear factor analysis reveal only one significant factor, then 
the test is considered unidimensional. In the case of dichotomous data, however, it is well 
known that linear factor analysis of phi correlations between items often leads to 
overestimation of the number of factors underlying the item responses (Carroll, 1945; 
Hambleton k Swaminathan, 1985, Chapter 2; Hulin, Drasgow, & Parsons, 1983, Chapter 8; 
McDonald & Ahlwat, 1974). As a corrective alternative, tetrachoric correlations can be 
used for factor analysis. When guessing is present in the responses to items, however, linear 
factor analysis of tetrachoric correlations can produce a spurious factor due to difficulty of 
test items (Hulin, Drasgow, & Parsons, 1983, Chapter 8). In addition, computation of 
tetrachoric correlations can be problematic if any one of the correct /incorrect cells of the 
two— by— two item response tables contains a zero. Matrices of simple tetrachoric 
correlations are thus often non— gramian. As a result, conventional methods of factor 
analysis by phi or tetrachoric correlations are often unsatisfactory for assessing the 
dimensionality of test items. Christoffersson (1975) and Muthen (1978) have developed 
generalized least squares methods to overcome the problems with factor analysis of 
tetrachoric correlations, but their methods are limited to 25 items at most. Moreover, they 
are computationally intensive. 

In recent years a vast body of literature has been developed for assessing the 
dimensionality of test items. Comprehensive reviews of different procedures for assessing 
dimensionality are provided by Hattie (1984, 1985), and Hulin, Drasgow, and Parsons 
(1983, Chapter 8). Some of the more recent procedures developed to assess latent trait 
dimensionality include: maximum likelihood full information factor analysis (Bock, 
Gibbons, & Muraki, 1985); the Tucker and Humphreys procedures based on local 
independence and first and second factor loadings (Roznowski, Tucker, & Humphreys, 
1991); Stout's (1987) procedure for assessing unidimensionality based on the theory of 
essential independence; modified parallel analysis, which combines latent trait methods and 
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factor analysis and uses eigenvalues of tetrachoric correlation matrix (Huiin, Drasgow, &: 
Parsons, 1983, p. 255); McDonald's nonlinear factor analysis (McDonald, 1962; McDonald 
& Ahlawat, 1974; Etazadi— Amoli & McDonald, 1983); Holland and Rosenbaum's test of 
unidimensionality, monotonidty, and conditional independence (Holland, 1981; Holland &; 
Rosenbaum, 1986); residual analysis, determined by modd^-data fit (Hambleton & 
Swaminathan, 1985, Chapter, 8); and Bejar's procedure based on three—parameter logistic 
item parameter estimates (Bejar, 1980). Some of these methods are reviewed in Hambleton 
and Rovinelli (1986), Mislevy (1986), and Zwick (1987a). 

Although these different approaches offer promise for assessing the dimensionality of 
binary data, researchers in the field have not reached a consensus on one satisfactory 
method (Berger & Knol, 1990; Hambleton & Rovinelli, 1986; Hattie, 1984; Zwick, 1987b). 
Primarily, this is due to the fact that there is substantial confusion in the literature 
concerning the definition of unidimensionality. Additionally, many existing methods for 
assessing dimensionality are only loosely connected to the various definitions in the 
literature (Hambleton & Rovinelli, 1986). 

This article is concerned with Stout's procedure for assessing unidimensionality 
(DIMTEST). Stout (1987) has developed a nonparametric statistical procedure based on 
the large sample distribution theory for assessing latent trait dimensionality and has 
argued the validity of this procedure based upon simulation studies involving a variety of 
achievement tests. DIMTEST has been shown to discriminate well between one— and 
two— dimensional tests, maintaining good adherence to a specified level of significance when 
d=l and mu*iitaining good power when (i=2, even when the correlation between the 
abilities is as high as .7. 

The present study provides a detailed investigation of certain performance 
characteristics and the consequent major refinements of DIMTEST for assessing latent 
trait dimensionality. DIMTEST was found to perform undesirably in certain c^ses where 
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the test contained many highly discriminating items with guessing present. A correction is 
proposed to overcome this limitation. In addition, an automatic approach is devised for 
determining Af, the size of the assessment subtests; a better control of a, the specified level 
of significance, is achieved by adjusting the estimated standard error of Stout ''s statistic T. 
These refinements have led to an improved test procedure that is easier to uso and has been 
shown in simulation studies to adhere closely to the nominal level of significance while 
achieving considerably greater power. Finally, the procedure is applied to a selection of real 
data sets. 

Stout's Procedure for Assessing Unidimensionality 

As stated in the beginning of this paper, items are multiply determined, and thus 
the number of dominant abilities should be assessed in testing for dimensionality. Stout 
first informally (1987) and then formally (1990) provided a definition of the number of 
dominant dimensions known as essential dimensionality, which is derived from the theory 
of essential independence. The statistical procedure for ass^sing essential unidimensionality 
is consistent with the definition of essential dimensionality. To assist the reader in 
evaluating this claim as well as to enable the reader in understanding the refinements made 
to DIMTEST, Stout's definition of essential dimensionality will be followed by a brief 
simimary of the statistical procedure. The reader is advised, however, that use of 
DIMTEST does not require acceptance of Stout's notion of essential dimensionality, and, 
in fwCt, DIMTEST can also be viewed as a technique to detect sizable lack of fit of a locally 
independent unidimensional latent trait model. 

Let denote the i-th item response and Uj^^ {U^.U^r-'U^jj^ denote the test 
response vector for an AMtem test. Observed item and test values will be denoted by 
and 2kjsf^ (^i)'^)--'^^)) respectively. Let C/^. = 1 denote a correct response and U^ — Q 



ERLC 



8 



Stout's procedure for unidimensionality — 6 



denote an incorrect response to item i for a randomly chosen examinee. The latent random 
vector is denoted by ^ and the particular values it takes are denoted by 0. Let P^i) denote 
the probability that a randomly chosen examinee with ability will get the t-th item 
correct. It is assumed that all item response functions P^{(t) are monotone. Let 11=^ {U^^ 
%>1\ denote the item pool consisting of JZ^y-as its first iV items. The item pool is 
conceptualized as a result of continuing the test construction process in the same maimer 
beyond the construction of the iV items that make up the actual test X/'jy being studied. 
One advantage of using only the partially observed XT instead of the actually observed Uj^ 
to model the test is that a totally rigorous definition of the number of dominant dimensions 
can be given. These ideas are carefully and formally developed in Stout (1990) and 
constitute a large sample approach to test modeling. 

Definition 1 (Stout, 1990) The item pool His said to be essentially independent (EI) 
with respect to the latent variable ^ if satisfies 

I |Cov(C/.,[7.|0 = ^| 

for every £. 

The distinction between local independence and essential independence is tha" local 
independence requires Gov {U^,Uj\Q = = 0 for ail 9; whereas, essential independence 
requires the average value of | Gov ( U^, Uj\Q = 9)\ over all item pairs to be small in 
magnitude for all g as the test length increases. Hence, essential independence is a weaker 
assumption than local independence. 
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Deanition 2 (Stout, 1990). The essential dimensionality (d^) of an item pool jZis 
the minimal dimensionality (number of elements in g) necessary to satisfy the assumption 
of essential independence. When d^^ 1^ essential unidimensionality is said to hold. 

The reader should note that d^l means that II has an IRT model for which 
essential independence holds for a unidimensional latent trait 0, The ordinary definition of 
IRT dimensionality is the same a^ Definition 2, but essential independence is replaced by 
local independence and JZ replaced by Jl^^ Stout argues that the assumptions concerning 
local independence and the resulting ordinary IRT definition of dimensionality should often 
be replaced by the respective weaker assumptions concerning essential independence and 
essential dimensionality. Junker (1988, 1991) has proved results concerning essential 
independence and, in particular, has derived statistical consistency results for maximum 
likelihood estimates of ability under the assumption of essential unidimensionality. 

It can now be clearly stated what assessing the hypothesis of essential 
unidimensionality means: among all the essentially independent monotone IRT models for 

does there exists a unidimensional one? To ansv/er this question, we assume both 
monotonidty and essential independence and assess the lack of fit of unidimensionality. 
This approach is similar to most other procedures for assessing data dimensionality, with 
the exception that essential rather than local independence is assumed. 

The statistical procedure for testing the null hypothesis of essential 
unidimensionality will be briefly described here. For further details see Stout (1987, Sec. 4). 
The iVtest items are split into two assessment subtests of length M each — called the 
Assessment 1 subtest (ATI) and the Assessment 2 subtest (AT2) — and a longer subtest 
called the partitioning subtest (PT) of length n (= The M items for subtest ATI 

are selected to have the same dominant trait. This splitting can be done using either expert 
opinion or exploratory factor analysis. Whatever method used to select items of ATI, the 
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goal is to select a small subset of items (up to one-fourth of the total test length seems a 
good convention) that all measure the saa\e dominant trait and, at the same time, are as 
dimensionally different as possible from the PT items. Once items for ATI axe selected, a 
second set of M items is selected for AT2 from the remaining items so that AT2 items have 
a difficulty distribution similar to ATI items (Step 6, Siout^4987). The remaining n (= 
iV— 2Af) items then become the partitioning subtest PT. 

Each examinee is assigned to one of if subgroups according to his/her score. on PT. 
After eliminating subgroups vrith too few examinees (•^jujn=20 recommended), within each 
subgroup, ky two variance estimates, the usual variance estimate (cr^, and the 
"unidimensional" variance estimate {<r^^,^e computed using items of ATI. 



with U^^ denoting the response of the jth examinee to the ith item from the fcth subgroup, 




where 




arid denoting the number of examinees in the Ath subgroup. 




where 




The difference in these variance estimates is then normalized by an appropriate 



normalizing constant 5. and summed over subgroups to arrive at the statistic 
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T - ^ ._ Y r "U,k 1 



(2) 



Afc, (3) 



where 
and 

Similarly, using items of AT2, the two variance estimates c^, er^^ and the 
standard error of estimate 5^^ are computed and normalized within each subgroup to arrive 
at the statistic using formula (2). The statistic T to assess departure from essential 
unidimensionality is given by 

T^{T^-Tq)/^. (4) 

The null hypothesis of d^l is rejected if T > Z^, where is the upper 100(l~a) 
percentile of the standard normal distribution, and a being the desired level of significance. 

Correction for Bias in the Statistic Tj^ by Intioductio.i of Tq 



Consider the statistical bias that would result if rather than T were the statistic 
used to assess essential dimensionality. The above description shows that Stout's test is 
based on two variance estimates: the usual variance estimate aj^, and the unidimensional 
vanance estimate o-^^^. If the items of the test measure one dominant trait, then the two 
subtests ATI and PT would contain essentially unidimensional items representing the same 
dominant trait. When the test length is both long and essentially unidimensional, 
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examinees within each subgroup can be assumed to be of approximately equal ability. 
Consequently, it can be shown that the differences in the variance estimates ^jf^^u^}^ 
computed using items in ATI, would be "small"; thus, using T^, the test will be assessed 
as essentially unidimensional. By contrast, if the test length is long and essentially 
multidimensional, the trait measured by items of ATI would be different from the trait(s) 
measured by the rest of the test, and the ATI differences <rj^-tr^^ would not be small (see 
Stout, (1987) for the heuristics explaining why this holds), and would conclude the test 
to be essentially multidimensional. 

In the case of a relatively short essentially unidimensional test, however, examinees 
within each subgroup are not likely to be approximately equal on the dominant trait 
measured by the test, thereby causing the differences ^}f<^jj]^^^ be large. This improperly 
inflates the value of the statistic Tjr and results in statistical bias. This bias is amplified if 
items of ATI are homogeneous with respect to item difficulty, which often occurs when 
ATI is selected by factor analysis. To correct for this preasymptotic bias in T^, AT2 is 
constructed so that items of ATI and AT2 are closely matched in their item difficulty 
distribution. It has been observed that subtests ATI and AT2 are both subject to similar 
amounts of pre-asymptotic bias, but because AT2 is chosen to be similar to ATI in 
difficulty only, Tq formed from AT2 will not be made larger by the presence of 
multi dimensionality. Thus, as statistical experimental design ideas suggest, the bias is 
cancelled by forming the difference statistic T (Step 6, Stout, 1987). 

Avoiding Bias due to Guessing and High Discrimination of Items 

Test items usually differ with respect to their various measurement properties. 
There may be difficult items, easy items, high discrimination items, low discrimination 
items, and so on. The SCMtem SAT-Verbal vocabulary test analyzed by Lord (1968) is no 
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exception. Item parameter estimates for this test were obtained by LOGIST. DIMTEST 
with a specified level of significance a=.05 was applied to a three— parameter logistic, 
unidimensional, simulation model on various random subtests of 50 items selected horn this 
test. For 100 replications of the DIMTEST on the simulated test data, 5 rejections of the 
hypothesis of d^l were observed — strongly confirming the uni dimensional nature of the 
simulated data. Items of the SAT—Verbal test were then divided into two sets. One set 
consisted of items having discrimination parameters greater than 1.0 
(high— discriminations); other set consisted of items with discrimination parameters less 
than or equal to 1.0 (low-^scriminations). DIMTEST was applied separately to each 
subtest and the results were markedly different. Note from the classical test theory 
perspective that the first test has high reliability and the second test low reliability. 



Table 1 



Table 1 displays the performance of the procedure, for both subtests, administered 
to 750, 1000, and 2000 examinees. In these simulations, seven items were selected in each of 
the assessment subtests based on factor analysis with a J_ • =20. The reported values in 
Table 1 are the number of rejections out of the 100 replications of DIMTEST. 

The number of rejections for the test with low— discriminations is what is to be 
expected on a unidimensional test. However, the rejection rate for the test with 
high— discriminations far exceeds the nominal level of 5/100. Furthermore, as the number of 
examinees increases, the rejection rate also increases. 

This finding was confirmed in another unidimensional simulation, which used the 
ASVAB general science test as its basis. Item parameter estimates for this test were 

u 
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obtained by Mislevy and Bock (1984). In this simulation a rejection rate of 13/100 was 
observed with a=.05. Further investigation showed that this elevated rejection rate was 
caused by a preponderance of difScult, highly discriminating items. Thus, there is evidence 
to show that if many items of a test are both highly discriminating and difficult with 
guessing present, the observed type-I error rate may be unacceptably inflated. 

In an attempt to determine the cause(s) for excess bias, Monte Carlo simulations 
were investigated extensively with tests of high— discriminating items. Recalling that items 
for ATI were chosen according to the magnitude of their loadings on the second extracted 
factor (Step 1, Stout, 1937), it was found that in the case of high-discriminations with 
guessing present (with d^l), the second factor was a very pronounced difficulty id^ctoi 
e^en though tetrachoric correlations were used. One of the characteristics of the difficulty 
factor is that very easy and very difficult items have high loadings of the opposite sign. In 
the case of high— discriminations, for unknown reasons, but likely due to the presence of 
guessing, most often very easy items tended to have larger factor loadings in magnitude on 
the second factor than the corresponding collection of very difficult items. Consequently, 
the easiest items terded to be selected for ATI. To control for statistical bias, DIMTEST 
then selects the easiest remaining items for AT2. Therefore, PT is left with mostly difficult 
items. Because examinees are grouped according to their scores on PT, which mostly 
consists mostly of difficult items in this case, the paititioning subtest (PT) tends to 
misclassify low ability examinees. This misclassification is made worse if guessing is 
allowed. Thus, examinee abilities within each assigned subgroup may vary considerably, 
leading to a serious violation of the fundamental assumption of essential independence 
within subgroups. This assumption is critical for the statistic T to adhere closely to the 
nominal level of significance. As a result, the values of the statistic (computed from 
ATI) averaged around 10, the values of Tq (computed from AT2) averaged around 7. 
Thus, the values of T T^'^-^^'jj^/v^ ^^^^ so l^^e that the hypothesis of d^^ = 1 was 
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often rejected.^ Although is supposed to compensate for the bias in Tjr , the bias in 

was so large that compensation was ineffective. 

It is interesting to note that there are tSSL reasons why the SAT subtest with 

low-<iiscriroinations failed to exhibit statistical bias. First, low-Hdiscriminations enhance 

4 

the ability of AT2 to compensate (in a statistical, experimental design balancing sense ) 
for the bias contributed by items of ATI. Second, the SAT subtest with 
low— discriminations has a wider distribution of item difficulty, thereby tending to reduce 
the misclassifi cation of examinees in the formation of subgroups. 

Another unidimensional simulation study was conducted with the same 
high-^scriminations SAT items, but with all c-parameters set to zero, creating a 
high-discriminations 2PL model. There was 1 rejection out of 100 trials. Therefore, the 
presence of guessing coupled with high— discriminations seemed to have caused the inflated 
rejection rate. This is true because, without guessing in the model, a highly pronounced 
difficulty factor is unlikely to appear in the tetracaoric factor analysis and, in fact, did not 
appear in high— discriminations 2PL simulations. Moreover, eliminating guessing reduces 
the problem of misclassification of low-ability examinees. 

Based on the above findings, it was conjectured that when guessing/high 

discrimination items are present, the assignment of examinees to subgroups could be done 

more effectively using PT scores that were based on items that included easy items. This 

was achieved in the following way. First, items of ATI are checked statistically, using the 

Wilcoxon rank sum test, to test if the items of ATI are too easy as a group. If the 

Wilcoxon rank test rejects, the procedure is to replace these items with items of highest 

5 6 

loadings of the opposite sign so that they are still dimensionally homogeneous ' . If the 
Wilcoxon rank test does not reject, items of ATI are retained. Algorithm 1 in the 
Appendix describes this procedure in detail. Items of AT2 are selected, as before, so that 
items in ATI and AT2 have approximately the same difficulty distribution. 

16 
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Automating the Size Af of Assessment Subtests 



As described previously, DIMTEST splits iV items of the test into three subtests: 
ATI and AT2 of length M each, and PT of length n (= 2M). In all the simulation 
studies presented in Stout (1987), the size of the assessment subtests Afwas specified by 
the user a priori. For example, for a 30-item test, 5 or 7 items were used in each of the 
assessment subtests; for a 5(Mtem test, 8 or 12 items were used. By contrast, our aim has 
been to develop an algorithm that automatically determines the size of assessment subtests 
according to the magnitude of item loadings on the second extracted factor. For most 
applications this would seem preferable to the selection of M, a priori, especially by a 
novice user. 

According to Stout's large sample theory for DIMTEST, Af should be small 
compared to N. Extensive Monte Carlo simulations showed that a minimum of four items 
was needed in each of the assessment subtests in order to have reliable variance estimates 
(Nandakumar, 1987; Stout, 1984, p. 31). To determine the maximum size of M {Max M) 
that will yield desirable results, three different sizes of Af were tried: Afax Af = 1/5 of the 
test length. Max M = 1/4 of the test length, and Afox Af = 1/3 of the test length. 
Similarly, to determine the minimum size of factor loading that should be used for 
assigning an item to ATI, three different "starting" values {Start) of factor loadings were 
tried: Start - .25, Start = .20, and Start = .15. An experimental design was set up for 
conducting simulations with all three sizes of Af ox Af and with all three values of Start. For 
each combination of Af ox Af and Starts both type-I error and power were observed over 
repeated trials of DIMTEST with tests of different types. To illustr?^te, let Afox Af = 1/5 
and Start = 0.25. Based on the loadings of the second factor, items witli absolute loading 
greater than .25 are to be considered for ATI selection. The average item loading is 
computed for items with positive loadings and for items with negative loadings. The set 
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with the highest average loading, in absolute value, is selected for ATI and the size of this 
set deterraines M. If the minimum required number of items is not obtained with either 
positive or negative loadings, the start value is decreased by .05 until the minimum number 
of itemfi is foimd. Similarly, if in the selected set more than 1/5 of the items have absolute 
loadings greater than .25, only 1/5 of the items with the highest loadings are included. 
Algorithm 2 in the Appendix describes this procedure in detail. The observation of type— I 
error and power for different values of Max M and Start revealed that Max Af = 1/4 and 
Start = .15 yielded the most desirable results. These values were then used for selection of 
items in simulations reported in the Tables 3 through 7 of this paper. Other combinations 
of Max Mmd Start yielded either an observed type-I error rate that is too high or an 
observed power level that is too low. 



Standard Error Estimation in Stout's Statistic 



The general approach used in the development of Stout's statistic first derived an 
asymptotically valid test statistic and then made adjustments to optimize the 
pre— asymptotic behavior of the statistic, guided by Monte Carlo simulations. 

Stout's statistic to test the hypothesis of essential unidimensionality was built by 
combining information measuring the strength of evidence of the nonunidimensionality 
contributed by each of the k =• l,...,if subgroups of examinees. That is, the goal was to 
construct a statistic using the quantities 

from k subgroups of examinees. Each Xj^ measures nonunidimensionality in the sense that 
Xj^^Q when d^l, and > 0 on average when d^> 1. The most obvious approach is to 
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add up the contributions of Xj^ and then normalize this sum by an appropriate standard 
error of estimate. When unidimensionality holds, Stout (1987) found the estimated 
asymptotic variance of Xj^ to be 



leading to the statistic 

T" = (T'^' -T'q' 

where 

X 

T' ' = ^ (7) 

Result 6.1 of Stout (1987, page 599) suggests that under regularity conditions when d^l, 
Tj^' and ' should be asymptotically N(0,1) as the number of examinees and the number 
of items both approach od. Moreover, Result 6.4 of Stout (1987, page 601) states that both 
r^' and T' ' should have asymptotic power one when d^\. 

Simulation studies conducted prior to the study reported in Stout (1987) showed 
that, for test lengths and examinee population sizes typically encountered in practice, the 
statistical test ' falsely rejected the hypothesis of unidimensionality more frequently 

Q 

than the nominal error rate . Two modifications for constructing T ol (4) were then 
considered: (a) enlarge S'^' to 5^ of (3) so that the values of Ton the average would be 
smaller, thereby reducing the rate of occurrence of type-I error to close to or even below 
the nominal level, and (b) normalize each X^ by its estimated standard error and then sum 
(instead of first summing and then normalizing as in (7)). This modified statistic T was 
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used in siraulation studies reported in Stout (1987). However, the observed average type~I 
error (.023) in Stout (1987, Table 2) was well below the nominal level (or = .05). 

Because Sj^^ yielded too large an observed type— I error and yielded too small an 
observed type— I error, the following adjustment to the estimated standard error was 
considered in addition to 5, in the present study. 



Furthermore, a basic question in constructing the statistic T was how to combine 
the building blocks ^j^of (5) into a single appropriately normalized statistic for testing for 
unidimensionality. That is, restricting attention to linear scoring, the search was for an 



weighting procedures were considered. Six new statistics through Tg, as described 
below, were derived as a result of using different weights and standard errors of estimates. 
The objective was to find an improved statistic with an increased observed type-I error to 
approximate the nominal level while maintaining or even improving the power. 

An estimator or test statistic is useful provided it centers on the appropriate 
parameter and had a small standard error. It can be shown that Var is 
minimized, subject to the constraint = 1> by setting = 

[l/var(J^jj.)]/E'j^_j^[l/var(Jfp]. Based on this argument, the statistic - 
{Tj J- To was constructed where 




(8) 



It can be seen that 



appropriate choice of weights {dj^ 1 < k< K) to form ^—i^u^u^ Three different 
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■L,l 



(9) 



The statistic was constructe similar to the statistic T of (4) but with 5^ as the 
estimated standard error. That is, ^ given by 



f V7~ 



(10) 



The statistic Tg was constructed with weights as ia but with Sj^ as the estimated 
standard error. That is, Tj^^ is given by 



K X, 



(11) 



Based upon the naive, intuitive idea that those subgroups with more examinees in 
them should receive more weight in the constructed statistic, two more definitions T^ and 
Tg were proposed,where 



1 



k S, 



(12) 



and 



I- 



^kh 



k S, 



(13) 



respectively. 

Lastly, based upon Central Limit Theorem and contrasted with the statistic T of 
(3), the statistic Tg was derived where 
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In summary, Stout's (1987) recommended statistics Tas well as statistics T-^ and 

Tq use Sj^ as the estimated standard error, and the statistics, and Tg use as the 

estimated standard error. The statistics and use weights according to the principle 

of minimum variance with Sj^ and Sj^ as the standard errors of estimates, respectively. The 

statistics and Tg use weights (the number of examinees in each cell) and fJj^ 

respectively, with Sj^ as the standard error of estimate. And finally the statistic Tg is based 

g 

on the usual form of the Central Limit Theorem. 

We decided that statistics = (T^ --Tq ^/[S, i = 1,...6 with different weights 
and standard errors should provide an ample choice of statistics for a simulation study to 
assess whether an improved statistic can be obtained that would be better than using T. 



Monte Carlo Simulation Studies 



A Monte Carlo simulation study was undertaken to study the performance of 
DIMTEST after performing corrections for high-discriminations bias using the Wilcoxon 
rank sum test, automation of the size Af of assessment subtests, and correction for the 
standard error of estimate. In all simulations, •^jj^j^= 20 was adopted. The simulation 
study was designed to be similar to Stout's (1987) study in order to compare the 
performance of the statistic before and after the proposed corrections. 

Two issues were of particular importance in the study: (a) how well the nominal 
level of significance specified by the user (a=.05) is approximated by the observed level of 
significance when d^l^^, and (b) how large the power of the statistical test was in 
various dj^2 settings. 
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The Prdiiiinaxy Standard Error Study 

In a prfciiminary pilot simulation study, the performance of six different statistics 
through Tg was studied and compared with T (after implementing corrections for 
high-^discrimination with guessing and automated M) with respect to type— I error and 
power in various test settings. The results revealed that the statistic yielded a higher 
observed type-I error, closer to the nominal level, and a higher power than T. The statistic 
Tg yielded an unacceptably large type~I error; statistics Tp T^, T^, and Tg differed little 
in performance from T and thus would offer no advantage. Therefore, ihe statistic 
used in the simulations described below, and the results were compared with simulations of 
Stout (1987), obtained by using the statistic T prior to the proposed corrections. That is, T 
used Sj^ and does not correct for high-discriminating/guessing items, nor did it 
automatically select M. By contrast, used corrected for 
high-<iiscriminating/ guessing items and automatically selected M. 

The Unidimensional Simulation Study 

The unidimensional, three-parameter logistic model was used to simulate the test 
data. In order for the simulated test data to reflect real data, item parameter estimates 
were obtained from real data sets for five different tests: SATV, ACTM, ACTE, ASVAB 
AS, ASVAB AR^^. The distributions of item parameters for these five tests are given in 
Table 2, and show that the five tests differ not only in length but also in distribution of 
difficulty and discrimination parameters. For example, ACTE has the lowest mean and 
standard deviation of item discrimination parameters; ASVAB AR had the highest mean 
item discrimination; ASVAB AS had the highest standard deviation of item dibcrimination; 
etc. For each test type, two examinee sample sizes J were studied: 750 and 2000. With the 
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sample size of 750, 250 examinees were used for factor analytic selection of assessment 
items, while the reminder were used to compute the test statistic. With J= 2000, 500 
examinees were used for the factor analysis and the reminder were used for computing the 
statistic. 



Table 2 



Binary item responses were generated as explained below. Examinee abilities were 
randoKily generated from the standard normal distribution. For each simulated examinee, 
the probability, P^9)^ of correctly answering each item was computed using the 
three-parameter unidimensional logistic model. If a uniform random deviate in the interval 
(0,1) was less than or equal to the computed probability P^9)y the examinee was 
considered to have answered the item correctly and was given a score of 1; otherwise, the 
examinee was given a score of 0. 

For each combination of test type and examinee size, DIMTEST (as here modified 
by the Wilcoxon rank sum test, automated Af, and the alternate standard error of estimate 
S^') was replicated 100 times, with new examinee responses being simulated each time. 
The number of rejections out of 100 replications of testing the null hypothesis of essential 
unidimensionality is reported in Table 3. Because the test data is generated firom a 
unidimensional model, the observed level of significance should be close to the nominal 
level, which was set to ,05. 
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Table 3 and Table 4 



Table 3 shows the observed type~I error for all five simulated test types for 

different sample sizes. Of particular interest is the second column: rejection rates for the 

SATV high-Hiiscriminations. Contrasting these results with the rejection rates of Table 1 

shows that, with the proposed correction for excess bias (that is, the Wilcoxon rank sum 

test), the rejection rates have dropped to an acceptable level. For example, the rejection 

rate with 2000 examinees has dropped &om 58 to 7. For other test types, the observed level 

of significance is also close to the nominal level. Table 4 compares these results with those 

of Stout (1987, Table 2) where the statistic T was used. The contents of Table 4 show that, 

as a consequence of the proposed refinements, the observed type-I error rate has increased 

or remained the same for all test types and sample sizes except for ASVAB AR with 2000 

examinees. The overall average observed type— I error has increased from .023 (Stout, 1987) 

to .045 and is very close to the nominal value of .05. In addition, for each one of the cell 

entries, there is no statistical evidence to reject the hypothesis that the nominal level of 

12 

significance of .05 holds. That is, they are all consistent with a p-value of .05 . 

The Two~Dimensional Simulation Study 

The two— dimensional simulation study was modeled according to the 
multidimensional three-parameter logistic model with compensatory abilities (Reckase & 
Mckinley, 1983) given by: 
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(15) 



Seven different test types were considered to study the power of the procedure after 
the proposed changes. Two-Kiimensional counterparts of the five test types used in the 
unidimensional simulation study were simulated in the following maimer. The 



where /i and a were the mean ard the standard deviation of the distribution of 
discrimination parameters of the respective unidimensional test taken from Table 2. 
Likewise, and were assumed to be independent of each other for each item and were 
generated: 



where ii and a were the mean and the standard deviation of the distribution of difficulty 
parameters of the respective unidimensional test taken from Table 2. For example, to 



generated independently from the normal distribution with mean 1.07/2 and standard 
deviation .4/j5. Similarly, the ft^^'s and Jg^'s were generated independently from the 
normal distribution with mean .58 and standard deviation .88. Each test was taken to 
consist of "pure" items dependent on alone, "pure" items dependent on 0^ alone, 
and mixed items dependent on and Q^. 



discrimination parameters (a^^; a^^ of the two dimensions for each item were 
independently generated from a normal distribution; 




generate the two-dimensional counterpart of the SATV test, the /s and a2^'s were 
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Abilities Q = (Bj^, Bj) were generated from a bivariate normal distribution with 
both means being zero and both variances being one. The correlation coefficient p between 
the abilities varied appropriately. The o~parameter was taken to be .20 for all items. 
Binary item responses were generated exactly as described for unidimensional tests using 
(15). 

In addition to the five two-dimensional counterparts of unidimensional tests, two 
more tests, the ACT Mathematics Usage Form 8B (ACTM8B) and the ACT Mathematics 
Usage Form 24B (ACTM24B) were used. For these two tests, estimated two-dimensional 
item parameters (a^^, a^^ and (6^^^, fig;) ^^^^ obtained from the American College Testing 
Program. Except for item parameter generation, which has been replaced by use of actual 
item parameter estimates, the rep^^onses for these two tests were simulated as described 
above. 

For each of the seven test types, two examinee sample sizes J were considered — 750 
and 2000 — and two levels of correlation p were considered — .5 and .7. As in the 
unidimensional study, when J=750, 250 examinees were used for factor analysis, and, when 
7=2000, 500 examinees were used for factor analysis. For each combination of test type, 
examinee sample size, and level of correlation, DIMTEST (as modified by the Wilcoxon 
rank sum test, automated Af, and the alternate standard error of estimate S^) was 
replicated 100 times, each time simulating new examinees. For the first five test types, a 
new set of item parameters was generated for each test after each 10 replications. The 
number of rejections over 100 replications is reported in Table 5 for each case. 



Table 5 and Table 6 
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In the case of d^2y one wants good power; that is, one wants - ^^^^ 
for a broad range of realistic d^2 alternatives. The contents of Table 5 show that the 
power is extremely high for the case of /?=.5 for both sample sizes. The power is very high 
for /?=.7 with 2000 examinees, and the power is good for p=.7 with 750 examinees. These 
results are noteworthy, considering that all tests in the simulation study consist of at least 
one— third mixed items requiring knowledge of both traits to be answered correctly. 
Furthermore, it can be seen that as the sample size increases, the power also increases. 

Table 6 compares the results of the present study with the results of Stout's 
simulation study which uses the statistic T (1987, Table 6). It can be seen, as a 
consequence of the proposed refinements, that the power has increased for every test type, 
sample size, and level of correlation. On the average, power has gone up from 67 to 88 
rejections per 100 trials of the procedure for the case of p=.5 with 750 examinees^ from 92 
to 99 rejections for the case of p=.5 with 2000 examinees, from 36 to 54 for the case of 
p=.7 with 750 examinees, and from 67 to 90 rejections for the case of p=.7 with 2000 
examinees. These average increases are large enough to be of practical importance. 

Real Data Study 

Four different data sets were used to examine the performance of DIMTEST on 
actual data. Data for two Armed Services Vocational Aptitude Batteries, used by the 
Department of Defense Student Testing Program in high schools and post-secondary 
schools, were obtained from Linn, Hastings, Hu, and Ryan (1987). These tests included 
Arithmetic Reasoning tests for Grades 10 and 12 (ARIO & AR12), each with 30 items and 
1984 and 1961 examinees, respectively. Two more data sets were obtained from American 
College Testing (ACT) Program. These included ACT mathematics usage Forms B and C 
(F29B & F29C), each with 40 items and 2491 and 2494 examinees, respectively. 
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DIMTEST was appUed to each of the four data sets. In each data set, 500 examinees 
were randomly selected for factor analysis; the rest were used for computing the statistic. 
Examinees were randomly split into two groups, one group for performing factor analysis 
and the other for computing the statistic, ICQ times — each time testing for the null 
hypothesis of essential unidimensionaUty. The number of rejections over 100 replications of 
the procedure noted. The results for all tests are tabulated in Table 7. 



Table 7 



The contents of Table 7 suggest that, according to the DIMTEST, ARIO and AR12 
should be assessed as essentially unidimensional tests while F29B and F29C should be 
assessed as multidimensional tests. Examination of items of F29B and F29C showed that 
these tests consist of items assessing knowledge of arithmetic and algebra operations, 
geometry, numeration, story problems, and advanced topics. Therefore, from the 
perspective of content, F29B and F29C would seem to be multidimensional tests measuring 
highly correlated abilities. The rejection rate for AR12 is slightly higher than expected for 
an essentiaUy unidimensional test. One or two items highly influenced by another factor 
may contribute to this high rejection rate, or many items may be slightly influenced by a 
second factor. Further investigation is necessary to examine possible reasons. 

Summary and Discussion 

Detailed investigation of DIMTEST for assessing unidimensionality revealed certain 
Umitations. It failed to perform desirably when the test consisted of predominantly 
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difficult, high-discrimination items coupled with guessing present. This limitation was 
overcome by a more appropriate selection of assessment items. Also, an automated 
approach was devised to determine the size of assessment subtests, and the estimate of the 
standard error of the statistic was adjusted to yield the desired level of significance for 
d^l data and higher power for d^l data. After the proposed refinements were 
implemented, DIMTEST was applied to a variety of simulated tests for different sample 
sizes; these tests were modeled on Stout's (1987) simulation study. 

Comparison of the results of the present study with the results of Stout's (1987) 
study indicates that the proposed refinements have improved the observed level of 
significance. It is now dose to the nominal level for d^l simulations and has considerably 
increased the power for d^2 simulations for different levels of correlations and sample 
sizes. In addition, the procedure has been used on a number of real data sets. The results of 
the real test data study seem to confirm the a priori hypotheses regarding the 

1 o 

dimensionality of these tests. 

The refinements have led to a revised test procedure that is, in particular, more 
robust against unusually high-discrimination parameters with guessing present and that, in 
general, is able to perform more desirably with respect to type-I and type—II errors. 
Moreover, the procedure is automated and totally data— dependent in its selection of 
assessment subtest items, making it more user friendly. The automation of the size of the 
assessment subtests could especially benefit the novice user. Because the power of the 
statistical test heavily relies upon appropriate selection of items for ATI, our simulation 
study provides further evidence that the use of linear factor analysis for selection of these 
items is a promising approach that requires little effort on the part of the user. 

When the statistical test rejects the null hypothesis of essential unidimensionality, 
it is possible to proceed in several ways. One approach would be to reexamine the test and 
assess the complexity of the essential multidimensionality present using DIMTEST, 
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NOFA, and so forth. If inference suggests that each of the different dominant traits 
influences a distinct group of items (i.e., there is a pronounced simple structure), the test 
could be split into several essentially uni dimensional subtests, and each one could be 
analyzed separately using unidimensional IRT models. Alternately, if most of the items of 
the test are each influenced simultaneously by several dominant dimensions, then the 
researcher may need to resort to multidimensional parametric models in order to make 
inferences about the test data (Reckase, 1985, 1989). 

The dimensionality of a set of item responses is conceptually very complex. It is a 
function of items, examinees, and extraneous factors such as type of instruction and stage 
of leanxing. Also, dimensionality is, from the practical perspective, a continuum. Because 
items are multiply determined, among finite length tests {the only kind available in 
applications), there is no such thing as a strictly unidimensional test. But we can still 
describe a given set of item responses as being well modeled by an essentially 
unidimensional test model. Junker (1990, 1991) argues that an index for the continuum of 
dimensionality should be developed with strict unidimensionality, in the sense of fitting 
local independence models on one end and strict essential multidimensionality on the other 
end, with essential unidimensionality in between. Junker and Stout (1991) have developed 
indices for lack of essential unidimensionality, which can be extremely useful for assessing 
the degree of lack of essential unidimensionality when Stout's test of dg=l is rejected. 
Additionally, these indices show when it is safe to use unidimensional estimation 
procedures such as LOGIST or BILOG to arrive at accurate ability estimates. The 
conjecture is that lack of strict unidimensionality is not detrimental, provided dji=l 
modeling provides a good approximation to reality. The number of items influenced by the 
secondary dimensions, as well as the strength of the influence of secondary dimensions, on 
each item should determine how strong the lack of dj^l is. Nandakumar (1991) has 
demonstrated the utility of DIMTEST in assessing essential unidimensionality when test 
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items were influenced by various dimensions to various degrees, and thus strict 
dimensionality exceeded one. Nandakmnar has found that the accuracy of the 
approximation of essential unidimensionality for a test is a function of the proportion of 
test items influenced by the various nondominant traits present and by the strength of the 
influence of these traits. 

Stout's procedure seems very promising for assessing the dimensionality underlying 
a set of items. It is an outgrowth of the conceptual definition of essential unidimensionality 
and was developed to be sensitive to dominant dimensions and insensitive to transient or 
minor dimensions. The procedure is nonparametric (thus avoiding parametric model-~<iata 
fit problems), supported by an asymptotic theory, and is computationally simplistic. 
However, the procedure is relatively new, and its applicability in a variety of realistic 
applications needs to be studied further. Software to run DIMTEST is available from the 
authors. 
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Appendix 

Algorithm 1: Test for DifBculty Factor 

1. Rank the iV items from most difficult (rank 1) to easiest (rank iV). 

2. Compute the sum W. of the ranks of the M items in ATI. 

3. Compute the mean E( VI ^ and the standard deviation SD( of the 

sum VTg under the assumption of randomly distributed ranks: 

E(vr; = ^iw(m) 

SD(W^; -(^M(iV^Ai)(iV^i))^/2 

4. Compute the critical value Cfor under the usual large sample 

approximation: 

where is the upper 100(l-a)th percentile of the standard normal 
distribution and a is the desired level of significance. 

5. If P7 > C, conclude that M items in ATI are too easy. 
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Algorithm 2: The Size M of Assessment Subtests 

Let N = total number of items, Mlow = 4, Mhigh ^ 1 -^J , and 
Maxload = .15. 

1. Compute 

a) Iy=: Number of positive loadings > Maxload, 

b) = Number of negative loadings < --Maxload. 

2. Redefine 

:= min {Mhigh^ /^) 
:= min {Mhigh, I^^. 



3. If both < Mlow and < M/oti;, then define 

Masdoad := Maxload — .05. 
Go to Step 1. 

4. If either or is > Mlow, then let 



/j^ if > Mlow 
if > Mlow 



5. If both > Mlow and /2 > Mlow^ then compute the averages Avgl 
and Avg2 of item loadings for sets corresponding to and 
respectively. Let 

if Avgl > Avg2 
if Avg2 > Avgl 
Moa^/p if Avgl = Avg2 
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Notes 

^Throughout, we speak of a unidimensional test, a unidimensional set of items, etc. This 
convenient phrasing represents the more complex reality that the dimensionality of a model 
or a data set rests on the joint influence of test items and examinee population. Items and 
examinees together produce test data that we judge by statistical inference to be 
unidimensional or not. Reckase (1990) writes perceptively on this point. Technically, IRT 
dimensionality is usually defined to be the lowest latent space dimension possible, such 
that monotonidty and local independence hold. 

"^Note that the statistic T^^ computed from ATI is sensitive to dimensionality (that is, it 
can discriminate between d^l vs d^l) and to sources of bias. The idea in introducing 
AT2 is to deliberately make Tq sensitive only to sources of bias but not to dimensionality. 

•^In unidimensional settings where the procedure worked well, typical values of ranged 
roughly from 1 to 5, and typical values of Tq ranged roughly between .6 to 4.0; thus 
typical values of T ranged roughly between -1.0 to 1.5. 

^If a randomized block design with M blocks of size 2 is to be used in an experiment with 
human subjects assigned to control and treatment groups, this experimental design 
technique will work well unless the subjects are too variable. By rough analogy, the higher 
the discrimination parameters, the more "variable" are the items that are being assigned to 
ATI and AT2 and the less effective the difficulty matching method (analogous to blocking) 
of AT2 item selection is in eliminating bias. 

^It can be observed that when items in ATI are replaced (because they are too easy) with 
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items of high loadings of the opposite sign, easy PT items could result, thereby causing 
inaccuracy in subgroup assignment of high— scoring examinees. Simulation results have 
shown that this potential inaccuracy is not as detrimental to the value of the statistic T as 
it was when PT had mostly difficult items. 

We also tried to correct tetrachoric correlations for guessing by following Bock, Gibbons, 
and Muraki (1985) and by using nonlinear factor analyses to diminish the influence of 
difficulty on the second factor loadings. Regarding correction of tetrachorics, we found that 
when guessing values were about .2 in the model, a large percentage of the sample 
correlations was computed as 1 or —1. However, when the guessing levels were arbitrarily 
cut by half, the problem of extreme correlations was reduced. Even with this reduction of 
guessing levels, the items selected for ATI did not differ significantly from those selected 
without correction for guessing. Moreover, the ad hoc method of cutting guessing levels 
defeats the purpose of using the three— parameter logistic model. Therefore correction for 
guessing was not implemented. 

The nonlinear factor extraction program NOFA was used to select items for the 
assessment subtests. We tried two-factor quadratic model for this purpose. In comparing 
the results of linear and nonlinear factor analysis, we found no difference in T— values 
between the two methods. To our surprise the difficulty factor reappeared even with 
nonlinear factor analysis. Therefore, we did not implement nonlinear factor analysis. 

The reason for the word "suggests" instead of "establishes" is that Stout's result actually 
assumes unidimensionality under the stronger assumption of local independence. Further, 
the asymptotic invariance in (6) also assumes the stronger assumption of local 
independence. 
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{Sy) is an asymptotic variance and fails to account for the overdispersion of Sj^ that 
occurs as examinees in a fixed PT subgroup have varying abilities, even though the test is 
unidimensional. Thus, (5]^')^ will underestimate the true standard error and will yield too 
large a type-I error (see Cox & Snell ,1989, pp 106-110 for a nice discussion of 
overdispersion resulting from varying parameter such as ability). 

^There are, of course, many more possibilities for computing statistics with given weights 
and standard errors of estimate, but those described here were considered the most 
appropriate. 

"'^Technically, our simulations were done with (i=l, implying dg=l. For simulation studies 
for which dj^l, see Nandakumar (1991). 

^^The SATV denotes the SAT-verbal test obtained from Lord (1968); ACTM denotes the 
ACT mathematics usage test, and ACTE denotes the ACT English Usage test, both 
obtained from Drasgow (1987); ASVAB AS and ASVAB AR denote the Armed Services 
Vocational Aptitude test Battery, Auto Shop Information and Arithmetic Reasoning 
respectively, both obtained &om Mislevy and Bock (1984). 

"^"^The standard error for testing the hypothesis of p=.05 vs p^^.OS is approximately 2.2 
trials. Thus, the acceptance region of this test for a set of 100 simulations is given by (.7, 
9.3) trials. 

"^^We say "seem to" because one cannot really know that a real data set is dj^l or d^l. 
Further, the 100 replications of Table 7 are not the result of 100 administrations of the test 
to similar examinee populations, but rather 100 variations of the application of the statistic 
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to one data set that resulted firom one administration of the test. 
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Table 1 

Rejection rates per 100 trials for d^=l simulation study using 
estimated item parameters of SAT verbal test with 0.05 



Discrimination Number of Number of examinees 

parameter items 

750 1000 2000 



0 < < 1.0 41 


4 


0 


3 


(low- discriminations) 








1.1 < < 2.0 39 


28 


46 


58 


(high.- discriminations) 
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Table 2 

Sample distributions of item parameters for the five 
standardized tests used in the study 











ASVAB 


ASVAB 




* 

SATV 


ACTM 


ACTE 


AS 


AR 












30 








i o 


Max a^'s 


2.00 


2.00 


1.58 


2.82 


2.76 


Min a^'s 


0.40 


0.40 


0.11 


0.32 


0.50 


Aean a^^ s 


1 n7 

1 . U 1 


1 no 


0 72 

• t Jit 


1 22 


1.46 


S.D a.'s 

1 


0.40 


0.35 


0.25 


0.70 


0.51 


Max b^'s 


2.50 


1.50 


2.07 


1.27 


1.01 


Min b^'s 


-1.50 


-1.02 


-3.11 


-1.39 


-2.72 


Mean b^'s 


0.58 


0.50 


0.03 


0.09 


-0.02 


S.D b^'s 


0.88 


0.61 


0.96 


0.72 


0.84 


Max c^'s 


0.20 


0.21 


0.27 


0.26 


■ 0.34 


Min c^'s 


0.04 


0.02 


0.04 


0.06 


0.08 


Mean c^^'s 


0.16 


0.14 


0.15 


0.20 


0.19 


S.D c^'s 


0.05 


0.04 


0.03 


0.04 


0.06 



N denotes the test length. 

SATV denotes the SAT verbal test battery. 
ACTM denotes the ACT mathematics usage test battery. 
ACTE denotes the ACT English usage test battery. 
ASVAB AS denotes the Armed Services Vocational 
Aptitude Battery for auto shop information. 
ASVAB AR denotes the Armed Services Vocational 
Aptitude Battery for arithmetic reasoning. 



Table 3 

Results of unidimensional simulation study: Rejection rates for testing 
the null hypothesis of d^^l over 100 trials with c=-20 and a=.05 



J 


SATV* 


SATV 
high dis 


ACTM 


ACTE 


ASVAB AS 


ASVAB AR 


750 


6 


8 


5 


6 


2 


3 


2000 


6 


7 


4 


4 


2 


1 



SATV and ACTE each contain more than 50 items in the pool, but 50 items 
were randomly selected for the study. After each 10 of 100 trials a new 
sample of 50 items was chosen. For other tests the same test was used for 
all 100 trials. 
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Table 4 

Comparison of unidimensional simulation study results of this paper with 
those in Stout (1987): Rejection rates for testing the null hypothesis of 
a[p=l over 100 trials with c=.2 and a=.05 





SATV 


ACTM 


ACTE 


ASVAB AS 


ASVAB AR 


Study 


750 2000 


750 2000 


750 2000 


750 2000 


750 2000 


Stout* (1987) 


2 6 


1 4 


3 1 


1 1 


2 4 


Present 


6 6 


5 4 


6 4 


2 2 


3 1 



♦ 

For all tests the rejection rate reported is the average of rejection 
rates (rounded to nearest integer) for the two different M values 
reported in Table 2 of Stout (1987) . 
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Results of two-dimensional simulation study: Rejection rates for testing 
the null hypothesis of d^-l over 100 trials with c=.20 and a=.05 







SATV 


ACTl 


ACTE 


ASVAB AE 


ASVAB A£ 


ACTM24B 


ACTM8B 






17- 17- 16 


13- 13- 14 


17- 17- 16 


8-8-9 


10- 10- 10 


0-0-40 


U- U- 50 






750 2000 


750 2000 


750 2000 


750 2000 


750 2000 


2000 


2000 


p 


= .5 


93 100 


97 100 


81 100 


73 99 


94 98 


99 


100 


p 


= .7 


58 96 


66 97 


37 83 


50 83 


61 91 


69 


98 
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Tabl^^ 6 

Comparison of two-dimensional simulation study results of this paper with 
those in Stout (1987): Rejection rates over 100 trials for testing the 
null hypothesis dy=^l with c=.2, and a=.05 







SATV 


ACTH 


ACTE 


AO TAX 


AO 


AqvAR AB 

AOVAD AU 




if U V m 


17-17 


- ifi 




-14 


17- 17- 16 


B Q 

o- o- 


Q 


iU- xU- XU 




J 


750 


2000 


750 


2000 


750 


2000 


750 


2000 


750 2000 




Stout* (1987) 


62 


98 


69 




59 


90 




87 


76 - 
























Present 


93 


100 


97 


100 


81 


100 


73 


99 


94 98 




Stout* (1987) 


36 


83 




74 




55 




54 


67 


/)=.? 






















Present 


58 


96 


66 


97 


37 


83 


50 


83 


61 91 



For all tests the rejection rate reported is the average of rejection 
rates (rounded to nearest integer) for the two different M values reported 
in Table 6 of Stout (1987) . 



Table 7 

Results of real data study: Rejection ates for testing 
the aull hypothesis of (ijj=l over 100 replications of random 



selection of subjects with a= 


.05 




A&IO 


AR12 


F29B 


F29C 


M: 30 


30 


40 


40 


J: 1984 


1961 


2491 


2494 


6 


13 


86 


82 
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